White Wine Quality Analysis by Lee Clemmer

In this analysis I will be investigating which chemical properties influence the quality of white wines. The white wine dataset is one part of a large dataset (nearly 5,000 observations) of white and red vinho verde samples from Portugal.

As I have no knowledge in this domain, I will have to carefully review the data description (see section Univariate Analysis) and let the data speak to me. Going into the analysis, my assumption is that one or more the chemicals will correlate strongly with the quality of the wine, whether it is sulfur dioxide, chlorides, or acidity of the wine. The descriptions of the features suggest some relationships between the features (e.g. density as a function of alcohol and residual sugar content) and those will be natural starting points for investigation.

Univariate Plots Section

## [1] "No. of Observations and No. of Variables"
## [1] 4898   12
## [1] "Variable Names"
##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"
## [1] "Data Structure"
## 'data.frame':    4898 obs. of  12 variables:
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
## [1] "Summary of Variables"
##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.878  
##  3rd Qu.:6.000  
##  Max.   :9.000

## 
##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

Wine quality ratings are distributed as integers between 3 and 9. Only 5 wines out 4, 898 were rated a 9, while only 20 received the lowest score of 3.

##            n
## 1 0.08922009

pH values are normally distributed around the mean of 3.188. Almost 9% fall below a pH of 3. Since pH describes how acidic or basic something is, I wonder if there is a tight relationship between pH and the other acidity related properties.

I increased the number of bins to get some better resolution. It looks like fixed acidity is normally distributed around the mean of 6.855, with a couple outliers beyond 11.

Volatile acidity is mostly normally distributed around the mean of .2782 with a slight positive skew. I wonder what the relationship between fixed and volatile acidity is.

Citric acidity is normally distributed around the mean of 0.3342, with a couple extreme outliers beyond 1.1. We can see peaks at 0.5 and 0.75, wondering if that a common amount of citric acid added to wine. Again wondering what the relationship is between all acidity related variables.

Residual sugar levels distributions show a peak around 2 and due several extreme outliers most of the distribution is on the left side of the histogram. To get a better look at the shape of the distribution I applied a square root transformation on the x-axis I would characterize the shape of the distribution as multimodal with several peaks and valleys The lowest such valley occurs between 3 and 4 before dropping off at around 18. There was only 1 wine with greater than 45 grams/liter sugar, which is considered sweet.

Due to the outliers in the positively skewed long tail I again applied a square root transformation to the x-axis to get a better sense of the shape of the bulk of the distribution. The pattern followed mostly a normal distribution around the mean of 0.04577, with a bit of a positive longtail.

## Source: local data frame [1 x 1]
## 
##       n
##   (int)
## 1   868

There are 868 wines with levels of free sulfur dioxide greater than 50, at which point it becomes evident in the nose and tast of the wine. I’ve added a derived binary variable to the data set that captures whether the wine exceed the threshold or not. I wonder what effect on quality this might have. I’ve also added another variable: free sulfur as a percentage of total sulfur dioxide. Perhaps the balance of free and bound forms of SO2 has an effect on quality?

Mostly normal distribution with some outliers beyond 250. I expect a strong correlation between Total Sulfur Dioxide and Free Sulfur Dioxide as the latter is a subset of the former.

Normally distributed with a touch of positive skew. Since sulphates can contribute to sulfur dioxide gas, I expect a strong correlation between sulphates and total sulfur dioxide.

Alcohol levels fall between 8 and just over 14 % alcohol by volume, with a positively skewed distribution peaking at around 9.5.

We can see that even with a square root transformation on the x-axis the outliers still cause the distribution to fall on the far left of the histogram. The shape is normal around the mean 0.9940. I wonder whether humans can really detect such small variations in liquid density, and whether that would have impact on quality.

Univariate Analysis

What is the structure of your dataset?

There are 4898 observations in which wines with 11 various chemical properties were rated on a scale of 0 to 10 by 3 wine experts.

The data include the following variables:

  1. fixed acidity (tartaric acid - g / dm^3): most acids involved with wine are fixed or nonvolatile (do not evaporate readily)
  2. volatile acidity (acetic acid - g / dm^3): the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
  3. citric acid (g / dm^3): found in small quantities, citric acid can add ‘freshness’ and flavor to wines
  4. residual sugar (g / dm^3): the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
  5. chlorides (sodium chloride - g / dm^3): the amount of salt in the wine
  6. free sulfur dioxide (mg / dm^3): the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
  7. total sulfur dioxide (mg / dm^3): amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
  8. density (g / dm^3): the density of wine is close to that of water depending on the percent alcohol and sugar content
  9. pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
  10. sulphates (potassium sulphate - g / dm^3): a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
  11. alcohol (% by volume): the percent alcohol content of the wine
  12. quality (score between 0 and 10): Output variable (based on sensory data)

Some other observations: * No wine was scored below 3 nor above 9; the median was 6. * The median alcohol content was 10.4% by volume. * pH levels varied between a minimum of 2.72 and a maximium of 3.82, with the median at 3.18.

What is/are the main feature(s) of interest in your dataset?

The main feature of interest in this data set is quality. I’d like to know if any of the chemical properties are highly correlated to quality and could be used to predict which wines are going to be better than others.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

After doing the univariate analysis, I’m actually not quite sure which variables will have the biggest effect on quality. There are two variable clusters - acidity (pH, Fixed Acidity, Volitile Acidity, and Citric Acid) and sulfur dioxide (Free Sulfur Dioxide, Total Suflur Dioxide, Sulphates) - that I think will exhibit strong correlation within one another. I wonder about the impact on quality of density and alcohol level as these aren’t necessarily taste related.

Did you create any new variables from existing variables in the dataset?

I created one new variable based on the fact that at a level of 50ppm free sulfur dioxide becomes evident in taste; the variable captures whether this taste is evident or not (TRUE/FALSE).

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Most distributions were normal, and most also had a handful of extreme outliers. The most unusual distribution was perhaps of residual sugar, which in addition to a large peak on the left and multiple smaller peaks.

Bivariate Plots Section

##                             fixed.acidity volatile.acidity  citric.acid
## fixed.acidity                  1.00000000      -0.02269729  0.289180698
## volatile.acidity              -0.02269729       1.00000000 -0.149471811
## citric.acid                    0.28918070      -0.14947181  1.000000000
## residual.sugar                 0.08902070       0.06428606  0.094211624
## chlorides                      0.02308564       0.07051157  0.114364448
## free.sulfur.dioxide           -0.04939586      -0.09701194  0.094077221
## total.sulfur.dioxide           0.09106976       0.08926050  0.121130798
## density                        0.26533101       0.02711385  0.149502571
## pH                            -0.42585829      -0.03191537 -0.163748211
## sulphates                     -0.01714299      -0.03572815  0.062330940
## alcohol                       -0.12088112       0.06771794 -0.075728730
## quality                       -0.11366283      -0.19472297 -0.009209091
## free.sulfur.dioxide.evident   -0.02794808      -0.03070756  0.118821472
## free.so2.pct.of.total         -0.13945918      -0.19616085  0.016241396
##                             residual.sugar   chlorides free.sulfur.dioxide
## fixed.acidity                   0.08902070  0.02308564       -0.0493958591
## volatile.acidity                0.06428606  0.07051157       -0.0970119393
## citric.acid                     0.09421162  0.11436445        0.0940772210
## residual.sugar                  1.00000000  0.08868454        0.2990983537
## chlorides                       0.08868454  1.00000000        0.1013923521
## free.sulfur.dioxide             0.29909835  0.10139235        1.0000000000
## total.sulfur.dioxide            0.40143931  0.19891030        0.6155009650
## density                         0.83896645  0.25721132        0.2942104109
## pH                             -0.19413345 -0.09043946       -0.0006177961
## sulphates                      -0.02666437  0.01676288        0.0592172458
## alcohol                        -0.45063122 -0.36018871       -0.2501039415
## quality                        -0.09757683 -0.20993441        0.0081580671
## free.sulfur.dioxide.evident     0.24122018  0.09426740        0.7149309817
## free.so2.pct.of.total           0.05142979 -0.03321768        0.7386321024
##                             total.sulfur.dioxide     density            pH
## fixed.acidity                        0.091069756  0.26533101 -0.4258582910
## volatile.acidity                     0.089260504  0.02711385 -0.0319153683
## citric.acid                          0.121130798  0.14950257 -0.1637482114
## residual.sugar                       0.401439311  0.83896645 -0.1941334540
## chlorides                            0.198910300  0.25721132 -0.0904394560
## free.sulfur.dioxide                  0.615500965  0.29421041 -0.0006177961
## total.sulfur.dioxide                 1.000000000  0.52988132  0.0023209718
## density                              0.529881324  1.00000000 -0.0935914935
## pH                                   0.002320972 -0.09359149  1.0000000000
## sulphates                            0.134562367  0.07449315  0.1559514973
## alcohol                             -0.448892102 -0.78013762  0.1214320987
## quality                             -0.174737218 -0.30712331  0.0994272457
## free.sulfur.dioxide.evident          0.452612740  0.25679614 -0.0586619285
## free.so2.pct.of.total               -0.013447850 -0.06552475  0.0008012900
##                               sulphates     alcohol      quality
## fixed.acidity               -0.01714299 -0.12088112 -0.113662831
## volatile.acidity            -0.03572815  0.06771794 -0.194722969
## citric.acid                  0.06233094 -0.07572873 -0.009209091
## residual.sugar              -0.02666437 -0.45063122 -0.097576829
## chlorides                    0.01676288 -0.36018871 -0.209934411
## free.sulfur.dioxide          0.05921725 -0.25010394  0.008158067
## total.sulfur.dioxide         0.13456237 -0.44889210 -0.174737218
## density                      0.07449315 -0.78013762 -0.307123313
## pH                           0.15595150  0.12143210  0.099427246
## sulphates                    1.00000000 -0.01743277  0.053677877
## alcohol                     -0.01743277  1.00000000  0.435574715
## quality                      0.05367788  0.43557472  1.000000000
## free.sulfur.dioxide.evident  0.04096416 -0.24623432 -0.090581598
## free.so2.pct.of.total       -0.02236186  0.06446642  0.197214077
##                             free.sulfur.dioxide.evident
## fixed.acidity                               -0.02794808
## volatile.acidity                            -0.03070756
## citric.acid                                  0.11882147
## residual.sugar                               0.24122018
## chlorides                                    0.09426740
## free.sulfur.dioxide                          0.71493098
## total.sulfur.dioxide                         0.45261274
## density                                      0.25679614
## pH                                          -0.05866193
## sulphates                                    0.04096416
## alcohol                                     -0.24623432
## quality                                     -0.09058160
## free.sulfur.dioxide.evident                  1.00000000
## free.so2.pct.of.total                        0.46986612
##                             free.so2.pct.of.total
## fixed.acidity                         -0.13945918
## volatile.acidity                      -0.19616085
## citric.acid                            0.01624140
## residual.sugar                         0.05142979
## chlorides                             -0.03321768
## free.sulfur.dioxide                    0.73863210
## total.sulfur.dioxide                  -0.01344785
## density                               -0.06552475
## pH                                     0.00080129
## sulphates                             -0.02236186
## alcohol                                0.06446642
## quality                                0.19721408
## free.sulfur.dioxide.evident            0.46986612
## free.so2.pct.of.total                  1.00000000

Some initially surprising correlations are found: quality has moderate positive correlation (.42) with alcohol and a weak negative correlation with density (-.29).

Let’s look at these a bit closer.

## ww$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.55   10.45   10.34   11.00   12.60 
## -------------------------------------------------------- 
## ww$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.40   10.10   10.15   10.75   13.50 
## -------------------------------------------------------- 
## ww$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.000   9.200   9.500   9.809  10.300  13.600 
## -------------------------------------------------------- 
## ww$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.50    9.60   10.50   10.58   11.40   14.00 
## -------------------------------------------------------- 
## ww$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.60   10.60   11.40   11.37   12.30   14.20 
## -------------------------------------------------------- 
## ww$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.50   11.00   12.00   11.64   12.60   14.00 
## -------------------------------------------------------- 
## ww$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   12.40   12.50   12.18   12.70   12.90

In the scatter plots we can see that as quality increase, levels of alcohol tends to be higher, as also shown by the linear smoothing line. This becomes even more apparent when studying the boxplot and summarizing median alcohol levels per quality rank. Wines rated 7 and above have a median alcohol level of 11.4 and higher, while wines rated 6 and below have a median alcohol level of between 9.5 and 10.5.

Let’s take a look at quality vs. density.

## ww$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9911  0.9925  0.9944  0.9949  0.9969  1.0000 
## -------------------------------------------------------- 
## ww$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9892  0.9926  0.9941  0.9943  0.9958  1.0000 
## -------------------------------------------------------- 
## ww$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9872  0.9933  0.9953  0.9953  0.9972  1.0020 
## -------------------------------------------------------- 
## ww$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9876  0.9917  0.9937  0.9940  0.9959  1.0390 
## -------------------------------------------------------- 
## ww$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9906  0.9918  0.9925  0.9937  1.0000 
## -------------------------------------------------------- 
## ww$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9903  0.9916  0.9922  0.9935  1.0010 
## -------------------------------------------------------- 
## ww$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9896  0.9898  0.9903  0.9915  0.9906  0.9970

When plotting density against quality, we see visually the negative correlation. It appears that as density increases, quality decreases. This trend is particularly noticable at grades 7, 8, and 9. When we look at the boxplot of the data, we can indeed see that the median for 7, 8, and 9 are below the lower grades, which have a median of between .9937 and .9957. The higher quality wines have median densities of between .9903 and .9918.

This is surprising! I wouldn’t have guessed that density would have been one of the more well correlated variables. However, we know that “the density of wine is close to that of water depending on the percent alcohol and sugar content”. And in fact this is exactly what the data bears out.

Let’s take a closer look at density. It has a strong positive correlation with residual sugar (.83, the strongest correlation found between all the variables) and a strong negative correlation with alcohol (-.77).

We can see a clear relationship between density and residual sugar: as residual sugar increases, so does density. As the description of the dataset indicated, residual sugars do indeed rarely go lower than 1, as indicated by the dotted red line. We also noticed, as hinted at by the histogram of residual sugar, that a large cluster of wines have sugar levels between 1 and 2.

We clearly see that as the alcohol levels increase, density decreases. Let’s see what the relationship looks like between alcohol and sugars.

As expected, alchol and sugar have negative correlation: the more sugar is left after fermentation, the less alcoholic the wine. I assume this because the sugar has not been converted to alcohol, and therefore the wine is less alcoholic, and more sweet.

## ww$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.04587 0.13950 0.19860 0.25740 0.29930 0.65680 
## -------------------------------------------------------- 
## ww$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.03371 0.10460 0.15730 0.18040 0.23540 0.58140 
## -------------------------------------------------------- 
## ww$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.02362 0.17190 0.23810 0.23770 0.29650 0.65000 
## -------------------------------------------------------- 
## ww$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.03361 0.19840 0.25860 0.26220 0.32050 0.71050 
## -------------------------------------------------------- 
## ww$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0500  0.2118  0.2717  0.2757  0.3333  0.6429 
## -------------------------------------------------------- 
## ww$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.07895 0.22310 0.28830 0.28920 0.33630 0.60380 
## -------------------------------------------------------- 
## ww$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1942  0.2258  0.2743  0.2911  0.2824  0.4790

## ww$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    19.0   105.8   159.5   170.6   210.0   440.0 
## -------------------------------------------------------- 
## ww$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    10.0    85.0   117.0   125.3   171.5   272.0 
## -------------------------------------------------------- 
## ww$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   121.0   151.0   150.9   182.0   344.0 
## -------------------------------------------------------- 
## ww$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    18.0   107.2   132.0   137.0   164.0   294.0 
## -------------------------------------------------------- 
## ww$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    34.0   101.0   122.0   125.1   144.2   229.0 
## -------------------------------------------------------- 
## ww$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    59.0   102.5   122.0   126.2   150.0   212.5 
## -------------------------------------------------------- 
## ww$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      85     113     119     116     124     139

The relationship between the presence of sulfur dioxide (SO2) and quality of wine is bit murky. If we consider free SO2 as a percentage of total SO2, we find a weak positive correlation (.19), and for total SO2 we find a weak negative correlation (-.17). In other words, the less SO2 the better, and the less bound SO2 (not free), the better for wine quality.

## ww$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1700  0.2375  0.2600  0.3332  0.4125  0.6400 
## -------------------------------------------------------- 
## ww$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1100  0.2700  0.3200  0.3812  0.4600  1.1000 
## -------------------------------------------------------- 
## ww$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.100   0.240   0.280   0.302   0.340   0.905 
## -------------------------------------------------------- 
## ww$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2000  0.2500  0.2606  0.3000  0.9650 
## -------------------------------------------------------- 
## ww$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.1900  0.2500  0.2628  0.3200  0.7600 
## -------------------------------------------------------- 
## ww$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.2000  0.2600  0.2774  0.3300  0.6600 
## -------------------------------------------------------- 
## ww$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.240   0.260   0.270   0.298   0.360   0.360

Among the acidity related features, volatile acidity has the strongest correlation, albeit a weak negative one (-.26): the more volatile acidity is present, the lower the quality. As was mentioned in the feature description, too much of this acidity and the wine begines to take on a vinegar taste.

## ww$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.02200 0.03625 0.04100 0.05430 0.05400 0.24400 
## -------------------------------------------------------- 
## ww$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0130  0.0380  0.0460  0.0501  0.0540  0.2900 
## -------------------------------------------------------- 
## ww$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.04000 0.04700 0.05155 0.05300 0.34600 
## -------------------------------------------------------- 
## ww$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01500 0.03600 0.04300 0.04522 0.04900 0.25500 
## -------------------------------------------------------- 
## ww$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.03100 0.03700 0.03819 0.04400 0.13500 
## -------------------------------------------------------- 
## ww$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01400 0.03000 0.03600 0.03831 0.04400 0.12100 
## -------------------------------------------------------- 
## ww$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0180  0.0210  0.0310  0.0274  0.0320  0.0350

Finally, chlorides are also weakly negatively correlated with quality (-.23); the more chlorides are in the wine, the worse the quality.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

The following features had the most effect on quality, in descending order of absolute correlation strength: alcohol (.42), density (-.29), volatile acidity (-.26), chlorides (-.23), and total sulfur dioxide (-.23). I was surprised both that alcohol had the strongest effect (I wouldn’t think this alone would say anything about quality), and the fact the residual sugars had such little correlation (-0.09) since it was so strongly correlated to both alcohol and density. I had also expected either the presence sulfur dioxide or acidity to have a greater correlation, but as it stands each only has a weak correlation with quality. I wonder if together these features could build a robust linear regression model with good predictive power.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

I was able to confirm the relationships between some features as they were described. For example, the density of wine was strongly correlated with both alcohol and residual sugar.

What was the strongest relationship you found?

The strongest correlaton I found was between residual sugar and density at .83. The more sugar is left after fermentation, the higher the density of the wine.

Multivariate Plots Section

Exploring the relationship between sugar, density, and alcohol a bit further, we can see the three features interact in the above plot. What we see is that the variation in density as sugar levels increase are explained neatly by the alcohol content: the higher the alcohol content, the lower the density, at all points on the sugar level spectrum.

Studying the effect on alcohol and density on quality in the grid of plots above, we notice that the distribution of wines shifts from top left (lower alcohol, higher density) to bottom right (higher alcohol, lower, density) as quality increases.

Taking a look at the same plot grid but with sugar instead, we notice that as quality increases, sugar levels drop.

Studying the effect of total sulfur dioxide, we see that the weight of the distribution shifts from right (more SO2) to left (less SO2), indicating again that quality goes down with increasing levels of sulfur dioxide.

Finally, taking a look at the interaction of some of the acidity features, we find that citric acid and fixed acidity have a weak positive correlation (.26). We also see several bands along values of citric acid of .5 and .75, corresponding to the peaks we saw in our citric acid histogram. The pH colors of the plot reveals, unsurprisingly, that the more acidic the wine, the lower the pH level.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Through multivariate analysis we were able to underline the import relationships between levels of alchol, residual sugar, density, and their effect on the quality of the wine. Without a doubt wines that are less dense, more alcoholic, and have less sugar tend to be higher rated. We were also able to verify again that total sulfur dioxides tend to decrease wine quality.

Were there any interesting or surprising interactions between features?

The only surprise was there weren’t stronger correlations on wine. Sulphates were generally not a feature that had any impacts. The SO2 and acid features had weak effects on level of quality. No unusual relationships were found that hadn’t already been hinted at in the feature descriptions.


Final Plots and Summary

Plot One

Description One

Fig. 1 shows the clear relationship between the two features with the strongest correlation out all variables, specifically a strong positive correlation of .83. We know that alcohol has a lower density than water, and that residual sugar is sugar that hasn’t been converted to alcohol during the fermentation process. Therefore, the more sugar is converted, the more alcohol is present and thus the lower the density.

As the description of the dataset indicated, residual sugars do indeed rarely go lower than 1, as indicated by the dotted black line. It appears that there is a natural barrier beyond which it is very difficult to continue fermenting any remaining sugar. This “wall” is shown by the large cluster of wines that have sugar levels between 1 and 2, which were hinted at by the histogram of residual sugar. In fact nearly 30% of wines have residual sugar content of less than 2 g/L.

The median residual sugar content (red dotted line) is 5.2. The median density is 0.99374 (purple dotted line).

Plot Two

Description Two

Building on our previous exploration of the relationship between sugar, density, and alcohol (Fig. 1), we can see the three features interact in Fig. 2. What we see is that the variation in density as sugar levels increase are explained neatly by the alcohol content: the higher the alcohol content, the lower the density, at all points on the sugar level spectrum. This makes sense as alcohol is naturally less dense then water.

We also notice that residual sugar and alcohol do indeed have a moderate negative correlation (specifically -.41): the higher the residual sugar content (in other words, the higher the level of sugar that hasn’t been converted to alcohol during the fermentation process) the lower the alcohol content. Of course what is not answered by the data is how much sugar went into the fermentation process to begin with; we can imagine that some wines have more sugar converted to alcohol than others but have the same residual sugar levels. Fig. 2 would seem to support this idea.

The median alcohol content is 10.4. In Fig. 2 the purple dotted line shows the median density of wine (.9944) at that alcohol level.

Plot Three

Description Three

In Fig. 3 we see the interaction of total sulfur dioxide and density on quality: the weight of the distribution shifts from right (more SO2) to left (less SO2), indicating that quality goes up along decreases in levels of total sulfur dioxide decrease and the density.

The median Total SO2 level is 134. At a quality of 5 the median level is 151, at 6 the median level is 132, and at 7 the median level is 122. Across the same quality levels we also notice that the median density levels drop form .9953 to .9937 to .9918. These relationships are also captured in the measures of correlation between each: density and total SO2 have a moderate positive correlation of .54 while quality has a correlation with density and total SO2 of -.29 and -.17 respectively. In other words, the less dense the wine and the lower the level of total SO2, the higher the quality.

We also observe a relationship between Free SO2 and Total SO2: unsurprisingly, the less Free SO2 there is, the lower the Total SO2 as well. These two features have a moderate to strong positive correlation of .6.


Reflection

I started my investigation of nearly 5000 white wines by studying the description of the various features. In them lay some hints about the relationship of the variables that I was able to confirm over the course of the analysis. Without any real domain knowledge, I was expecting to find that the features describing various levels of acidity, the presence of sulfur dioxide, and chlorides would have the greatest impact on the level of quality. In the end, however, I was surprised to find out that it was really the density of the wine and the relationship between density, alcohol, and residual sugar that had the greatest effect on the quality of the wine.

In fact the most difficult part of the analysis was trying to find meaningful relationships among more than two variables, especially in regards to quality. Many of the features only had a weak correlation with quality and thus the trends were a bit hard to discern in the visualizations.

One lingering question I’m left with is the presence of the bands when analyzing the citric acid feature. I wonder why there are bands around .5 and .75 and whether this has anything to do with the creation process. In general, even deeper analysis would surely benefit from further study into the craft of wine-making and what exactly takes place.

I was happy to validate some relationships that had been mentioned in the feature descriptions, namely the interaction between density, alcohol and residual sugar as well as the fact that as volatile acidity rises (starts tasting like vinegar) the quality decreases.

More broadly speaking, it was a valuable exercise in diving into a dataset without any prior knowledge and getting to know the ins and out through exploration. The clear next step would be to start to develop predictive models that could guess the quality of the wine depending on the values of various features. It seems like a great data set to try out various models like linear regression, random forests, neural networks, etc.

Reference

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016 [Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf [bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib